Implement OpenMP offload for do group update#1782
Conversation
| do i = is, ie | ||
| pos = pos + 1 | ||
| field(i,j,k) = buffer(pos) | ||
| idx = pos + (k-1)*nj*ni + (j-js)*ni + (i-is) + 1 |
There was a problem hiding this comment.
How are these two implementations equivalent? Is new idx = old pos always?
There was a problem hiding this comment.
Yes, they're equivalent. For any iteration, idx = pos + (k-1)*nj*ni + (j-js)*ni + (i-is) + 1 produces the same value that pos would have had at that point. The formula accounts for all the iterations that would have occurred in the nested loops up to that (i,j,k) position.
The reason for the change is that each nested iteration is now independent and can be performed in parallel.
|
In 0cc2a77, compiler macros are replaced with the runtime OpenMP For example, below halo communication times doing a self-communication (re-entrant domain) in MOM6 barotropic, where packing and unpacking are on GPU: And with the CPU OpenMP directives not compiled via macros: Hence, 0cc2a77 will be reverted. |
bensonr
left a comment
There was a problem hiding this comment.
For some of the target omp statements, the if (use_device_ptr) is prior to the target data mapping and other times after. Does it matter where in the !$omp specification the if-test occurs?
The preference for !$omp is default(none) with full specification of shared and private.
AFAIK clause order is not important. I tend to prefer ending with (Also note that some of the
👍 |
|
@edoyango Could you update this with main when you get a chance? Looks like its showing some conflicts that we should probably clear up before we get this reviewed. |
Hi @rem1776 i've tried rebasing on top of main locally, but nvfortran still can't build main because of the use of |
Sorry about that! Thought the fix (#1873) was merged but looks like its still waiting for a review. It should get merged sometime this week, I'll let you know. |
* add multi gpu support * address review comments, add helpful comment for the acc/mp runbtime call
To enable this, had to be removed - otherwise segfaults happen on the GPU.
This reverts commit 0cc2a77. Having both the CPU and GPU OpenMP directives compiled caused a significant slowdown in GPU packing/unpacking performance - even if parallelism is controlled using OpenMP "if" clause.
Some very minor changes to the OpenMP target MPI PR: * use_device_ptr -> use_device_addr This appears to be the updated form (or at least nvfortran says it is) * Whitespace added to `!$ use omp_lib` Does not seem crucial but from our previous discussion it appears more correct. * Removal of some trailing whitespace.
This patch refactors several lines to keep within the 121-character line length limit prescribed by the FMS style guidelines.
The no-comm (no MPI) interface has been updated to support the new omp_offload argument.
This ensures that (un)packing steps in do_group_update is performed with openmp cpu parallelism if ompoffload=.false.. Previously it would only do serial. This is implemented by undefining the GPU macro (currently __NVCOMPILER_OPENMP_GPU) and re-including the (un)packing files. To make this work, the default(shared) was used in all the relevant OpenMP directives. If default(none) is used, the loops would hang or segfault.
Did't use the if(use_device_ptr) omp clause because the tests would hang. Instead made explicit branches. Left out mappings because gcc didn't like cray pointers in maps
|
@rem1776 ok i've rebased on top of main as well as your select rank workaround and kept my tests. I did notice that main + your workaround produces garbled terminal output with nvhpc though... e.g. this is MOM6 standalone output: |
Update test_mpp_domains to pass the OpenMP offload flag through the group update path and stage test data on the device for offloaded runs. Add explicit OpenMP maps and shared clauses for the offloaded setup and reference-copy loops, including shifted CGRID and BGRID extents. Unskip the group update OpenMP offload test in test_mpp_domains2.sh.
Commit 22dfe08 redirected the mpp_error_basic label into a new text_errortype variable but only copied it into the printed `text` inside the npes>1 branch. On single-PE runs `text` was left uninitialized, so mpp_error emitted garbled stack memory. Seed `text` from text_errortype on the npes==1 path.
this turned out to be an issue with an uninitialized msg variable being printed when |
|
@edoyango Thanks for putting that mpp_error fix in! i think i was seeing something similar with gcc but i like your fix better. I'm going it mark this PR as ready for review so we can try to include it in our next tag. |

Description
This adds the necessary code to make mpp_do_group_update work with arrays that are managed by NVIDIA's OpenMP offload runtime. This attempts to be minimally disruptive in that non-nvidia compilers will see the same behaviour as previously by adding macros around the relevant openmp directives.
Fixes #1771
How Has This Been Tested?
The OpenMP offload capability is currently tested on the "double gyre" case in MOM6-examples using the
nvfortrancompiler and a cuda-aware openmpi. We have some notes on how to run the gpu-enabled MOM6, but is outdated.Checklist:
make distcheckpasses